Predicting the Acceptance or Rejection of Personal Loans using Machine Learning Classification Algorithms.

Introduction

The dataset pertains to a fictitious bank that has a growing customer base. The bank desires to expand its loan business and subsequently earn interest on the loans. In particular, the management seeks to target its liability customers (depositors) to purchase their personal loans. The bank ran a campaign last year, and the liability customers showed a healthy conversion rate of over 9% success. To reduce the cost of the campaign, the bank's marketing department wants to target their liability customers with a minimal budget cost. The bank's marketing department is interested in knowing how a machine learning algorithm or model can better predict a higher probability of purchasing the loan.

Objective

The objective of this project is to predict whether customers will purchase the personal loan or not, and which machine learning algorithm will provide the best accuracy.

Motivation

My motivation for pursuing this project is to learn how machine learning algorithms can be useful in financial institutions.

Data source

The data source used in this project is Kaggle. The URL of the dataset is below: https://www.kaggle.com/itsmesunil/bank-loan-modelling.

Load all of the essential libraries.

Dataset Loaded into Jupyter Notebook Utilizing an Excel File:

Feature Engineering

Feature engineering was performed to detect outliers, na values, and null values in the dataset. Data tidying performed to get a better performance from the model.

After the unnecessary features such as ID and ZIP Code, the remaining data features are shown above in the data table.

The new dimensiion of the dataset is 5000 * 12.

There are no null values in the dataset, as can be seen in heatmap above.

There are no na values in the dataset, as evidenced by the heatmap above.

After removing outliers from the original dataset, 3477 observations and 12 features were obtained.

The distribution and box plot of an Income variable have significantly improved. As a result, it can assist in improving the performance of machine learning models.

After excluding outliers, the statistical values of all variables drastically improved.

The heatmaps show the true picture of the dataset with or without outliers.

Exploratory Data Analysis (EDA)

I explored the data using advanced visualization libraries such as seaborn, matplotlib, plotly, cufflinks, and plotlyexpress. This section gives a through tour about the features in the dataset.

The true depiction of the significant variables can be seen using a pair plot. It also shows how variables are related to each other. Furthermore, the color of the data points indicates the type of education.

The correlation matrix plot shows whether there is a true positive or negative relationship between variables. It is an accurate description of the relationships between variables. In practice, it's an extensively utilized statistical tool.

The dendrograms on the top of the x-axis are shown in a hierarchical-clustering map. They demonstrate a similarity among the variables. A color bar scale is used to indicate the values of each variable.

The violin plot indicates the kernel density estimation of the underlying distribution. Moreover, it depicts more information than a box plot. It also demonstrates the median of the average spending of the customers on a credit card per month.

Data points are distinguished and displayed by personal loan type alongside a box whisker plot in a box plot created with the plotly express library. Furthermore, the data points are categorized by size of family. Customers who accepted a personal loan based on the highest median value belonged to a family with two members.

After filtering the type of personal loan, the underlying distribution of customers who accepted the loan is much smaller than the underlying distribution of customers who did not accept the loan, as shown in the distribution plot.

The underlying distribution and kernel density estimation are shown in a marginal histogram. Furthermore, it depicts a scatter plot with a positive linear relationship. As a result, higher-income corresponds to higher monthly credit card expenditure.

A facet grid is another way to glance at the data. I utilized the facet grid to display more information about customers who owned a credit card, used internet banking, and had a sufficient income. The customer's income is represented on the x-axis. On the other hand, the kernel density estimation (KDE) presents a smooth density plot where all the specified conditions were satisfied.

The count plot is a powerful statistical tool for exploring the bulk of data in terms of data exploration. Using count plots, I was able to see the differences between several variables. The count plots did a good job of describing it. It can also be read by someone who isn't a statistician. As a result, the count plots showed me a quick overview of the key variables in the dataset.

The following supervised machine learning algorithms are used in this project: Logistic Regression, K-nearest neighbors (KNN), decision tree, random forest, support vector machine (SVM), and artificial neural network (ANN).

1. Logistic Regression

The classification evaluation report and the confusion matrix demonstrated the logistic model's remarkable performance. The classification report included four key evaluation metrics: precision, recall, f1-score, and accuracy, some of which indicated that the overall outcome was impressive. The confusion matrix also Indicates the number of false positives (FP), false negatives (FN), total positives (TP), and total negatives (TN). False-negative, or FN (also known as Type-2 error rate), refers to customers who obtained a personal loan, but the prediction showed they didn't purchase a personal loan. On the other hand, FP (also known as Type-1 error rate) denotes customers who did not purchase personal loans but were predicted to do so. As a result, it's referred to as misclassification. The confusion matrix revealed the true representation of actual versus predicted values.

A binary classification graph closely reflects data points on 0 and 1, and is colored according to the type of personal loan. The x-axis represents a certain age group, which helps convey relevant information. The cutoff, 0.5, is used inside the binary classification plot.

Accuracy indicates how close a predicted value is to the true value. Therefore, a logistic model that showed 98% accuracy is quite impressive. I manually computed the sensitivity and specificity as well. The low sensitivity indicates many false negatives. On the other hand, the high specificity shows few false positives.

Receiver Operating Characteristic (ROC) curve is a plot of the true positive rate against the false positive rate. The area under curve (auc) indicates an excellent discriminatory ability.Higher the AUC value, better the model.

The error rate indicates the true representation of the actual values against the predicted values. Simply, it means at what proportion the predicted values were wrong. The percentage of the error rate in a logistic model is less than 5%, which isn't a bad sign for the generalization of the outcomes.

2. KNN (K-nearest neighbors algorithm)

The classification report showed a good performance using a k-neighbors equal to 1. The accuracy of a knn model is amazing. Furthermore, precision, recall, and f1 score are significantly improved compared to a logistic model. Furthermore, the confusion matrix misclassified only 24 total values. Without a doubt, the final result is superior to a logistic model.

Thus, after applying an elbow method, as the k value increases, the error rate also increases. As a result, more k-neighbors were considered in a model more elements were also taken into account.

The misclassified values were reduced to 23 with an optimal k-neighbor value. The true positive and true negative numbers have also increased slightly.

The KNN Algorithm correctly classified 0 and 1 of personal loans. The numbe 0 indicates the customers who didn't purchase the personal loan. Whereas 1 indicates the customers who purchased the personal loan. The scatter plot depicts the type of personal loan, income, as well as the customers' average monthly credit card usage.

As can be seen, it is imbalanced, which may impact the model's performance.

There is no change in the error rate because the accuracy has not changed. The evaluation metrics and misclassified values, on the other hand, had changed.

3. Decision Tree Algorithm

The decision tree model reduced the misclassification values to 21 when compared to the k-nearest neighbors algorithm. Other Important evaluation metrics like recall and f1 score have significantly improved. The precision, on the other hand, was reduced by 0.07. The accuracy has not been affected.

Compared to the logistic and knn models, the error rate is slightly lower. As a result, the decision tree is the most effective model thus far.

The decision tree is a very useful tool for making decisions. It has a top-level root node that is a family. There are decision nodes and terminal nodes, which are also known as leaves. A personal loan's data is classified using the decision tree. As a result, it's easy to understand even when you're not a statistician.

4. Random Forest Algorithm

A random forest model is superior to a decision tree model because it has somewhat better precision, misclassified values, and an f1 score. The accuracy, on the other hand, marginally improved.

The rate of error falls by 0.383 %. As a result, the random forest model has truly excelled.

Random forest can build numerous decision trees from a dataset by selecting observations/rows and specific features/variables at random and then averaging the results. It has a root node "family" on the top, and then it has decision nodes and terminal nodes. It is much readable and interpretable than a complex decision tree network.

5. Support Vector Machine (SVM) Algorithm

As can be seen, SVM only predicted one type of personal loan. We must first choose the best C (tuning/regularization parameter) to take both classes into account.

The model correctly predicted both classes of personal loans using the grid search cross-validation approach. Although I used the optimal c and gamma tuning parameters, it is not nearly as spectacular as the random forest.

We can display the support vectors and the training set by simply using the Matplotlib library to visualize the training data and stacking the support vectors on top.

6. Artificial Neural Network (ANN)

Training the Artificial Neural Network (ANN) / deep learning neural network with Stochastic Gradient Descent.

The true depiction of running an artificial neural network (ANN) model is shown in the validation accuracy & loss and training accuracy & loss plot. Since the training and validation losses have dropped, this is a positive sign for model evaluation.

After carefully analyzing the classification report and the confusion matrix, an artificial neural network is not as remarkable as the random forest. Also, it revealed a large number of misclassified values which makes it an inferior model in comparison to other supervised machine learning models.

In contrast to all other models, the artificial neural network had an extremely high error rate. This is not something the retail marketing department should utilize.

I showed a summary table about the accuracy and errors of the deployed machine learning algorithms in this project. As shown above random forest is the best machine learning algorithim.

Conclusion

After a thorough investigation, feature engineering, and exploration of the dataset variables, some of the most prominent supervised machine learning algorithms were deployed in this project. The objective of utilizing these advanced algorithms was to find the optimum algorithm for assisting the retail marketing department in identifying potential customers who are more likely to buy the loan. As a result, after testing the most sophisticated algorithms, I discovered that the random forest algorithm is the best fit for this dataset. The bank should apply the random forest algorithm, which will assist the bank's retail marketing department in cutting campaign costs. Last but not least, the random forest accurately predicted whether potential consumers would accept or refuse a personal loan in future campaigns.